Basic theory of
Feed-forward Neural Networks
Claudio Mirabello
Resources
MIT lectures on Deep Learning
(http://introtodeeplearning.com/)
TensorFlow Playground
(https://playground.tensorflow.org)
Keras Docs
(https://keras.io)
This is a neuron
wikipedia.org
This is a perceptron (1958)
Mark I Perceptron machine
wikipedia.org
MIT “Intro to Deep Learning”
Perceptrons caused excitement
"the embryo of an electronic computer that [the Navy] expects
will be able to walk, talk, see, write, reproduce itself and be
conscious of its existence."
The New York Times
Perceptrons can only learn
linearly separable classes
Minsky and Papert, 1969
Perceptrons can only learn
linearly separable classes
TensorFlow Playground
But sometimes you want
to model non-linear functions
How do we make this non-linear then?
Two ingredients to add
1: Differentiable, non-linear activation functions
Common activation functions
Special case: softmax
Used in classification problems
Given k classes, it decides which one is more likely
One output per class, each output is assigned a probability
from 0 to 1
The sum of probabilities for all outputs is 1
g
Wait a second, the perceptron already has a
non-linear (step) activation function!
This is a perceptron
2: Multi-layer Perceptron (1986)
Now we're getting somewhere
Why stop at one hidden layer?
Deep Networks are simply
NNs with multiple hidden layers
Deeper
this way
https://playground.tensorflow.org
Let's review:
- Perceptron
- XOR problem
- Activations
- Multi-layer perceptron
How do we decide
which weights are optimal?
A linear regressor's weights (coefficients)
are calculated in closed form
This can't be done if you have hidden layers
and non-linear activations
How do we decide
which weights are optimal?
How do we decide which weights are optimal?
Lower loss => better predictions
Minimizing loss
Minimizing loss
Minimizing loss
Minimizing loss
Minimizing loss
Gradient descent
Activation functions have to be differentiable!
The learning rate η
Backpropagation
Backpropagation
Backpropagation example, step by step:
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
Practical example
Simple network:
Two inputs [x1, x2]
Two weights [w1, w2]
No bias
Activation function g( )
One output ŷ
One label y
Loss function ( )𝓛
Weight-dependent error J(W)
x1
x2
ŷ
w
1
w
2
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
1. Forward pass
x1 = 0.50
x2 = 0.51
w1 = 0.35
w2 = 0.40
Σ = ?
ŷ = g(Σ) = 1/(1+e
-Σ
) = ?
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
1. Forward pass
x1 = 0.50
x2 = 0.51
w1 = 0.35
w2 = 0.40
Σ = 0.35 * x1 + 0.4 * w2 = 0.38
ŷ = g(Σ) = 1/(1+e
) = 0.59
0.50
0.51
0.35
0.40
0.38 0.59
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
2. Calculate Loss
x1 = 0.50
x2 = 0.51
w1 = 0.35
w2 = 0.40
Σ = 0.35 * x1 + 0.4 * w2 = 0.38
ŷ = g(Σ) = 1/(1+e
) = 0.59
Y = 0.10
J(W) = (y, ŷ) = ½ * (y – ŷ)𝓛
2
= ?
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
0.59
0.38
2. Calculate Loss
x1 = 0.50
x2 = 0.51
w1 = 0.35
w2 = 0.40
Σ = 0.35 * x1 + 0.4 * w2 = 0.38
ŷ = g(Σ) = 1/(1+e
) = 0.59
Y = 0.10
J(W) = (y, ŷ) = ½ * (y – ŷ)𝓛
2
= 0.12
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
0.59
0.12
0.38
0.10
3. Backpropagate the error
J(W) = (𝓛 g(X * W))
∂J(W)/∂w1
= ∂J(W) / ∂ŷ
* ∂ŷ / Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38 0.59
0.10
0.12
Derivatives calculated in this order
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
3. Backpropagate the error: loss
J(W) = (𝓛 g(X * W))
∂J(W)/∂w1
= ∂J(W) / ∂ŷ
* ∂ŷ / Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
J(W) = (y, ŷ)𝓛 = ½ * (y – ŷ)
2
∂J(W)
/ ∂ŷ
= ∂ (y, ŷ)𝓛
/ ∂ŷ = ?
0.50
0.51
0.35
0.40
0.38 0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
3. Backpropagate the error: loss
J(W) = (𝓛 g(X * W))
∂J(W)/∂w1
= ∂J(W) / ∂ŷ
* ∂ŷ / Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
J(W) = (y, ŷ)𝓛 = ½ * (y – ŷ)
2
∂J(W)
/ ∂ŷ
= 2 * ½ * (y – ŷ) * -1
= -y + ŷ = -0.1 + 0.59 = 0.49
0.50
0.51
0.35
0.40
0.38 0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
3. Backpropagate the error: activation
J(W) = (𝓛 g(X * W))
∂J(W)/∂w1
= 0.49
* ∂ŷ / ∂Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
ŷ = g(Σ) = 1/(1+e
)
∂ŷ
/ ∂Σ
= ?
0.50
0.51
0.35
0.40
0.38 0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
3. Backpropagate the error: activation
J(W) = (𝓛 g(X * W))
∂J(W)/∂w1
= 0.49
* ∂ŷ / ∂Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
ŷ = g(Σ) = 1/(1+e
)
∂ŷ
/ ∂Σ
= 1/(1+e
) * (1 – 1/(1+e
))
= 0.24
0.50
0.51
0.35
0.40
0.38 0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
3. Backpropagate the error: weight
J(W) = (g(𝓛 X * W))
∂J(W)/∂w1
= 0.49
* 0.24
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
Σ = X * W = x1 * w1 + x2 * w2
∂Σ/∂w1
= ?
0.50
0.51
0.35
0.40
0.38 0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
3. Backpropagate the error: weight
J(W) = (g(𝓛 X * W))
∂J(W)/∂w1
= 0.49
* 0.24
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
Σ = X * W = x1 * w1 + x2 * w2
∂Σ/∂w1
= x1 + 0 = 0.5
0.50
0.51
0.35
0.40
0.38 0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
4. Weight update
J(W) = (𝓛 g(X * W))
∂J(W)/∂w1
= 0.49
* 0.24
* 0.51
= 0.06 (gradient)
w1’ = ?
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38 0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
4. Weight update
J(W) = (𝓛 g(X * W))
∂J(W)/∂w1
= 0.49
* 0.24
* 0.51
= 0.06 (gradient)
w1’ = w1 – η*0.06 = 0.35 – 0.06 = 0.29
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38 0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
Exercise
Can you calculate the weight update for w2? How many new
gradients do you need to calculate?
What is the new predicted output? Has the error gone down?
What if I had another layer before this one?
0.50
0.51
0.35
0.29
0.40
0.34
0.38
0.32
0.59
0.58
0.10
0.12
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
4. Weight update
J(W) = (𝓛 g(X * W))
∂J(W)/∂w1
= 0.49
* 0.24
* 0.51
= 0.06 (gradient)
w1’ = w1 – η*0.06 = 0.35 – 0.06 = 0.29
W2’ = w2 – η*0.06 = 0.4 – 0.06 = 0.34
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38 0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g ŷ
y
𝓛
J
Gradient vanishing
∂J/∂W1 = ∂J/∂Ŷ * ∂Ŷ/∂Σ
N
* ∂Σ
N
/∂W
N
* ... * ∂Z
k
/∂Σ
k
* ∂Σ
k
/∂W
k
* … ∂Z
1
/∂Σ
1
* ∂Σ
1
/∂W
1
What happens if we backpropagate on a network with many
(N > k > 1) hidden layers?
Gradient vanishing
∂E/∂W1 = ∂J/∂Ŷ * ∂Ŷ/∂Σ
N
* ∂Σ
N
/∂W
N
* ... * ∂Z
k
/∂Σ
k
* ∂Σ
k
/∂W
k
* … ∂Z
1
/∂Σ
1
* ∂Σ
1
/∂X
1
= O(10
-N
)
initial w1 = 0.5
optimal w1 = -0.2
5-layer gradient ~0.00001
How many iterations do we need to get from 0.5 to -0.2?
These are all “zero-point-somethings” multiplied by each other
So the gradient becomes smaller by orders of magnitudes as we
go back more and more layers until it’s so small that the network
is stuck